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Abstract 

In his keynote speech at CHES 2004, Kocher advocated that side-channel attacks were an 
illustration that formal cryptography was not as secure as it was believed because some assump¬ 
tions {e.g., no auxiliary information is available during the computation) were not modeled. This 
failure is caused by formal methods’ focus on models rather than implementations. In this pa¬ 
per we present formal methods and tools for designing protected code and proving its security 
against power analysis. These formal methods avoid the discrepancy between the model and 
the implementation by working on the latter rather than on a high-level model. Indeed, our 
methods allow us (a) to automatically insert a power balancing countermeasure directly at the 
assembly level, and to prove the correctness of the induced code transformation; and (b) to 
prove that the obtained code is balanced with regard to a reasonable leakage model. We also 
show how to characterize the hardware to use the resources which maximize the relevancy of 
the model. The tools implementing our methods are then demonstrated in a case study on an 
8-bit AVR smartcard for which we generate a provably protected present implementation that 
reveals to be at least 250 times more resistant to CPA attacks. 

Keywords. Dual-rail with Precharge Logic (DPL), formal proof, static analysis, symbolic execu¬ 
tion, implementation, DPA, CPA, smartcard, PRESENT, block cipher, Hamming distance, OCaml. 


1 Introduction 

The need to trust code is a clear and proved fact, but the code itself needs to be proved before it can 
be trusted. In applications such as cryptography or real-time systems, formal methods are used to 
prove functional properties on the critical parts of the code. Specifically in cryptography, some non¬ 
functional properties are also important, but are not typically certified by formal proofs yet. One 
example of such a property is the resistance to side-channel attacks. Side-channel attacks are a real 
world threat to cryptosystems; they exploit auxiliary information gathered from implementations 
through physical channels such as power consumption, electromagnetic radiations, or time, in order 
to extract sensitive information (e.g., secret keys). The amount of leaked information depends on 
the implementation and as such appears difficult to model. As a matter of fact, physical leakages are 
usually not modeled when it comes to prove the security properties of a cryptographic algorithm. 
By applying formal methods directly on implementations we can avoid the discrepancy between the 
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model and the implementation. Formally proving non-functional security properties then becomes 
a matter of modeling the leakage itself. In this chapter we make a first step towards formally 
trustable cryptosystems, including for non-functional properties, by showing that modeling leakage 
and applying formal methods to implementations is feasible. 

Many existing countermeasures against side-channel attacks are implemented at the hardware 
level, especially for smartcards. However, software level countermeasures are also very important, 
not only in embedded systems where the hardware cannot always be modified or updated, but also 
in the purely software world. For example, Zhang et al. [ZJRRl^ recently extracted private keys 
using side-channel attacks against a target virtual machine running on the same physical server as 
their virtual machine. Side channels in software can also be found each time there are some non¬ 
logic behaviors (in the sense that it does not appear in the equations / control-flow modeling the 
program) such as timing or power consumption (refer to |KJJ99] 1. but also some software-specific 
information such as packet size for instance (refer to [M012] ). 

In many cases where the cryptographic code is executed on secure elements (smartcards, TPM, 
tokens, etc.) side-channel and fault analyses are the most natural attack paths. A combination of 
signal processing and statistical techniques on the data obtained by side-channel analysis allows to 
build key hypotheses distinguishers. The protection against those attacks is necessary to ensure 
that secrets do not leak, and most secure elements are thus evaluated against those attacks. Usual 
certifications are the common criteria (ISO/IEC 15408), the FIPS 140-2 (ISO/IEC 19790), or 
proprietary schemes (EMVCo, CAST, etc.). 

Power analysis. It is a form of side-channel attack in which the attacker measures the power 
consumption of a cryptographic device. Simple Power Analysis (SPA) consists in directly inter¬ 
preting the electrical activity of the cryptosystem. On unprotected implementations it can for 
instance reveal the path taken by the code at branches even when timing attacks |KJJ96j cannot. 
Differential Power Analysis (DPA) [K.I,I99] is more advanced; the attacker can compute the in¬ 
termediate values within cryptographic computations by statistically analyzing data collected from 
multiple cryptographic operations. It is powerful in the sense that it does not require a precise 
model of the leakage, and thus works blind, i.e., even if the implementation is blackbox. As sug¬ 
gested in the original DPA paper by Kocher et al. [K.I.I99] . power consumption is often modeled 
by Hamming weight of values or Hamming distance of values’ updates as those are very correlated 
with actual measures. Also, when the leakage is little noisy and the implementation is software. 
Algebraic Side-Channel Attack (ASCA) |R.S09) are possible; they consist in modelling the leakage 
by a set of Boolean equations, where the key bits are the only unknown variables |CFGR1^ . 

Thwarting side-channel analysis is a complicated task, since an unprotected implementation 
leaks at every step. Simple and powerful attacks manage to exploit any bias. In practice, there 
are two ways to protect cryptosystems: “palliative” versus “curative” countermeasures. Palliative 
countermeasures attempt to make the attack more difficult, however without a theoretical founda¬ 
tion. They include variable clock, operations shuffling, and dummy encryptions among others (see 
also |GM11| ). The lack of theoretical foundation make these countermeasures hard to formalize 
and thus not suitable for a safe certification process. Curative countermeasures aim at providing 
a leak-free implementation based on a security rationale. The two defense strategies are (a) make 
the leakage as decorrelated from the manipulated data as possible {masking |MOPn61 Chp. 9]), or 
(b) make the leakage constant, irrespective of the manipulated data (hiding or balancing |MOP06l 
Chp. 7]). 
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Masking. Masking mixes the computation with random numbers, to make the leakage (at least in 
average) independent of the sensitive data. Advantages of masking are (o priori) the independence 
with respect to the leakage behavior of the hardware, and the existence of provably secure masking 
schemes [EEin]. There are two main drawbacks to masking. First of all, there is the possibility of 
high-order attacks (that examine the variance or the joint leakage); when the noise is low, ASCAs 
can be carried out on one single trace |RSVC09] . despite the presence of the masks, that are 
just seen as more unknown variables, in addition to the key. Second, masking demands a greedy 
requirement for randomness (that is very costly to generate). Another concern with masking is 
the overhead it incurs in the computation time. For instance, a provable masking of AES-128 is 
reported in |RP10j to be 43 (resp. 90) times slower than the non-masked implementation with a 
1st (resp. 2nd) order masking scheme. Further, recent studies have shown that masking cannot be 
analyzed independently from the execution platform: for example glitches are transient leakages 
that are likely to depend on more than one sensitive data, hence being high-order [MSOBj . Indeed, a 
glitch occurs when there is a race between two signals, i.e., when it involves more than one sensitive 
variable. Additionally, the implementation must be carefully scrutinized to check for the absence 
of demasking caused by overwriting a masked sensitive variable with its mask. 


Balancing. Balancing requires a close collaboration between the hardware and the software: two 
indistinguishable resources, from a side-channel point of view, shall exist and be used according to 
a dual-rail protocol. Dual-rail with Precharge Logie (DPL) consists in precharging both resources, 
so that they are in a common state, and then setting one of the resources. Which resource has 
been set is unknown to the attacker, because both leak in indistinguishable ways (by hypothesis). 
This property is used by the DPL protocol to ensure that computations can be carried out without 
exploitable leakage |TVn6| . 


Contributions. Dual-rail with Precharge Logic (DPL) is a simple protocol that may look easy 
to implement correctly; however, in the current context of awareness about cyber-threats, it be¬ 
comes evident that (independent) formal tools that are able to generate and verify a “trusted” 
implementation have a strong value. 

• We describe a design method for developing balanced assembly code by making it obey 
the DPL protocol. This method consists in automatically inserting the countermeasure and 
formally proving that the induced code transformation is correct {i.e., semantic preserving). 

• We present a formal method (using symbolic execution) to statically prove the absence of 
power consumption leakage in assembly code provided that the hardware it runs on satisfies 
a finite and limited set of requirements corresponding to our leakage model. 

• We show how to characterize the hardware to run the DPL protocol on resources which 
maximize the relevancy of the leakage model. 

• We provide a tool called paiolQwhich implements the automatic insertion of the DPL counter¬ 
measure in assembly code, and, independently, is able to statically prove the power balancing 
of a given assembly code. 

• Finally, we demonstrate our methods and tool in a case study on a software implementation 
of the PRESENT BKL~*~07] cipher running on an 8-bit AVR micro-controller. Our practical 
results are very encouraging: the provably balanced DPL protected implementation is at 


'http://pablo.rauzy.name/sensi/paioli.html 
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least 250 times more resistant to power analysis attacks than the unprotected version while 
being only 3 times slower. The Signal-to-Noise Ratio (SNR) of the leakage is divided by 
approximately 16. 


Related work. The use of formal methods is not widespread in the domain of implementations 
security. In cases where they exist, security proofs are usually done on mathematical models 
rather than implementations. An emblematic example is the Common Criteria [Con 13] . that bases 
its “formal” assurance evaluation levels on “Security Policy Model(s)” (class SPM) and not on 
implementation-level proofs. This means that it is the role of the implementers to ensure that their 
implementations fit the model, which is usually done by hand and is thus error-prone. For instance, 
some masking implementations have been proved; automatic tools for the insertion of masked code 
have even been prototyped |MUPT12] . However, masking relies a lot on randomness, which is a rare 
resource and is hard to formally capture. Thus, many aspects of the security are actually displaced 
in the randomness requirement rather that soundly proved. Moreover, in the field of masking, most 
proofs are still literate (he., verified manually, not by a computer program). This has led to a recent 
security breach in what was supposed to be a proved [RPlOj masking implementation |CGP^12 . 
Previous similar examples exist, e.g., the purported high-order masking scheme [SPOBj . defeated 
one year after in [CPROTj . 

Timing and cache attacks are an exception as they benefit from the work of Kopf et al. [KHOTL 
IKD09] . Their tool, CacheAudit DFK~*~1^ . implements formal methods that directly work on x86 
binaries. 

Since we started our work on DPL, others have worked on similar approaches. Independently, 
it has been shown that SNR reduction is possible with other encodings that are less costly, such 
as “dual-nibble” (Chen et al. |CESY14] 1 or “m-out-of-n” (Servant et al. |SDMB14] 1. However, it 
becomes admittedly much more difficult to balance the resources aimed at hiding one each other. 
Thus, there is a trade-off between performance (in terms of execution speed and code size) and 
security. In this chapter we propose a proof-of-concept of maximal security. 

In this light it is easy to conclude that the use of formal methods to prove the security of 
implementations against power analysis is a need, and a technological enabler: it would guarantee 
that the instantiations of security principles are as strong as the security principles themselves. 


Organization of the chapter. The DPL countermeasure is studied in Sec. Sec. [^details our 
method to balance assembly code and prove that the proposed transformation is correct. Sec. 
explains the formal methods used to compute a proof of the absence of power consumption leak¬ 
age. Sec. is a practical case study using the present algorithm on an AVR micro-controller. 
Conclusions and perspectives are drawn in Sec. 


2 Dual-rail with Precharge Logic 

Balancing (or hiding) countermeasures have been employed against side channels since early 2004, 
with dual-rail with precharge logic. The DPL countermeasure consists in computing on a redundant 
representation: each bit y is implemented as a pair (ypaise, yTrue)- The bit pair is then used in a 
protocol made up of two phases: 
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Figure 1: Four dual-rail with precharge logic styles. 


1. a precharge phase, during which all the bit pairs are zeroized (ypaisej yxrue) = (0) 0); such that 
the computation starts from a known reference state; 

2. an evaluation phase, during which the (ypaisej l/Tme) pair is equal to (1,0) if it carries the 
logical value 0, or (0,1) if it carries the logical value 1. 

The value (ypaise, ?/Tme) = (1) 1) is unused. As suggested in |MAM'*~n3) . it can serve as a canary to 
detect a fault. Besides, if a fault turns a (1,0) or (0,1) value into either (0,0) or (1,1), then the 
previous functional value has been forgiven. It is a type of infection, already mentioned in |IPSW06l 
ISBG~*~0^ . Unlike other infective countermeasure, DPL is not scary |BG13j . in that it consists in an 
erasure. Indeed, the mutual information between the erased and the initial data is zero (provided 
only one bit out of a dual pair is modified). 

2.1 State of the Art 

Various DPL styles for electronic circuits have been proposed. Some of them, implementing the 
same logical and functionality, are represented in Fig. many more variants exist, but these four 
are enough to illustrate our point. The reason for the multiplicity of styles is that the indistinguisha- 
bility hypothesis on the two resources holding ypaise and yxrue values happens to be violated for 
various reasons, which leads to the development of dedicated hardware. A first asymmetry comes 
from the gates driving ypaise and yTme- In Wave Dynamic Differential Logic (WDDL) [TVOla] . 
these two gates are different: logical or versus logical and. Other logic styles are balanced with 
this respect. Then, the load of the gate shall also be similar. This can be achieved by careful 
place-and-route constraints |TV04bl IGHMP05] . that take care of having lines of the same length, 
and that furthermore do not interfere one with the other (phenomenon called “crosstalk”). As 
those are complex to implement exactly for all secure gates, the Masked Dual-rail with Precharge 
Logic (MDPL) [PMOh] style has been proposed: instead of balancing exactly the lines carrying 
ypaise and yTme; those are randomly swapped, according to a random bit, represented as a pair 
(?7iFaise, niTrue) to avoid it from leaking. Therefore, in this case, not only the computing gates are 
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the same {viz. a majority), but the routing is balanced thanks to the extra mask. However, it ap¬ 
peared that another asymmetry could be fatal to WDDL and MDPL: the gates pair could evaluate 
at different dates, depending on their input. It is important to mention that side-channel acquisi¬ 
tions are very accurate in timing (off-the-shelf oscilloscopes can sample at more than 1 Gsample/s, 
he., at a higher rate than the clock period), but very inaccurate in space (f.e., it is difficult to 
capture the leakage of an area smaller than about 1 mm^ without also recording the leakage from 
the surrounding logic). Therefore, two bits can hardly be measured separately. To avoid this issue, 
every gate has to include some synchronization logic. In Fig.[^ the “computation part” of the gates 
is represented in a grey box. The rest is synchronization logic. In SecLib [GCS~*~n^ . the synchro¬ 
nization can be achieved by Muller G-elements (represented with a symbol C |SEE98j l. and act as 
a decoding of the inputs configuration. Another implementation. Balanced Cell-based Differential 
Logic (BCDL) [NBD'*~10] . parallelize the synchronization with the computation. 

2.2 DPL in Software 

In this chapter, we want to run DPL on an off-the-shelf processor. Therefore, we must: (a) identify 
two similar resources that can hold true and false values in an indiscernible way for a side-channel 
attacker; (b) play the DPL protocol by ourselves, in software. We will deal with the former in 
Sec. 14.21 The rest of this section deals with the latter. 

The difficulty of balancing the gates in hardware implementations is simplified in software. 
Indeed in software there are less resources than the thousands of gates that can be found in hardware 
(aimed at computing fast, with parallelism). Also, there is no such problem as early evaluation, 
since the processor executes one instruction after the other; therefore there are no unbalanced paths 
in timing. However, as noted by Hoogvorst et al. IHDDllj . standard micro-processors cannot be 
used as is for our purpose: instructions may clobber the destination operand without precharge; 
arithmetic and logic instructions generate numbers of 1 and 0 which depend on the data. 

To reproduce the DPL protocol in software requires (a) to work at 
the bit level, and (b) to duplicate (in positive and negative logic) the 
bit values. Every algorithm can be transformed so that all the manipu¬ 
lated values are bits (by the theorem of equivalence of universal Turing 
machines), so 0 is not a problem. Regarding ([^, the idea is to use 
two bits in each register / memory cell to represent the logical value it 
holds. Eor instance using the two least signihcant bits, the logical value 
1 could be encoded as 1 (01) and the logical value 0 as 2 (10). Then, any 
function on those bit values can be computed by a look-up table indexed 
by the concatenation of its operands. Each sensitive instruction can be 
replaced by a DPL macro which does the necessary precharge and fetch 
the result from the corresponding look-up table. 

Fig. [2] shows a DPL macro for the computation of d = a op b, using 
the two least significant bits for the DPL encoding. The register ro is an 
always-zero register, a and b hold one DPL encoded bit, and op is the 
address in memory of the look-up table for the op operation. 

This DPL macro assumes that before it starts the state of the pro¬ 
gram is a valid DPL state (he., that a and b are of the form /. + (01110) R and leaves it in a valid 

^As a convenience, we use regular expressions notation. 
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DPL state to make the macros chainable. 

The precharge instructions (like ri •(— ro) erase the content of their destination register or 
memory cell before use. If the erased datum is sensitive it is DPL encoded, thus the number of bit 
flips (f.e., the Hamming distance of the update) is independent of the sensitive value. If the erased 
value is not sensitive (for example the round counter of a block cipher) then the number of bit flips 
is irrelevant. In both cases the power consumption provides no sensitive information. 

The activity of the shift instructions (like ri ri <C 1) is twice the number of DPL encoded bits 
in ri (and thus does not depend on the value when it is DPL encoded). The two most significant 
bits are shifted out and must be 0, i.e., they cannot encode a DPL bit. The logical or instruction 
(as in ri ri V r 2 ) has a constant activity of one bit flip due to the alignment of its operands. 
The logical and instructions (like ri •(— ri A 3) flips as many bits as there are Is after the two least 
significant bits (it’s normally all zeros). 

Accesses from/to the RAM (as in rs ■<— op[ri]) cause as many bit flips as there are Is in the 
transferred data, which is constant when DPL encoded. Of course, the position of the look-up table 
in the memory is also important. In order not to leak information during the addition of the offset 
{op -|- ri in our example), op must be a multiple of 16 so that its four least significant bits are 0 
and the addition only flips the number of bits at 1 in ri, which is constant since at this moment ri 
contains the concatenation of two DPL encoded bit values. 

We could use other bits to store the DPL encoded value, for example the least and the third 
least significant bits. In this case a and b have to be of the form / . + (0.1|1.0)/, only one shift 
instruction would have been necessary, and the and instructions’ mask would be 5 instead on 3. 


3 Generation of DPL Protected Assembly Code 


Here we present a generic method to protect assembly code against power analysis. To achieve 
that we implemented a tool (See App. 0 which transforms assembly code to make it compliant 
with the DPL protocol described in Sec. |2.2[ To be as universal as possible the tool works with a 
generic assembly language presented in Sec. 3.1 The details of the code transformation are given 


in Sec. 3.2 


3.3 


Finally, a proof of the correctness of this transformation is presented in Sec. _ 

We implemented paioli fusing the OCamQ programming language, which type safety helps to 
prevent many bugs. On our PRESENT case-study, it runs in negligible time (<C 1 second), both for 
DPL transformation and simulation, including balance verification. The unprotected (resp. DPL) 
bitslice AVR assembly file consists of 641 (resp. 1456) lines of code. We use nibble-wise jumps in 
each PRESENT operation, and an external loop over all rounds. 


3.1 Generic Assembly Language 

Our assembly language is generic in that it uses a restricted set of instructions that can be mapped 
to and from virtually any actual assembly language. It has the classical features of assembly 
languages: logical and arithmetical instructions, branching, labels, direct and indirect addressing. 
Fig. I gives the Backus-Naur Form (BNF) of the language while Fig. gives the equivalent code 
of Fig. [^as an example of its usage. 

The semantics of the instructions are intuitive. For 0pcode2 and Opcodes the first operand is 
the destination and the other are the arguments. The mov instruction is used to copy registers, 

^http://ocaml.org/ 
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’#’ <absolute-code-address> 

<label-name> 


Figure 3; Generic assembly syntax (BNF). 

load a value from memory, or store a value to memory depending on the form of its arguments. We 
remark that the instructions use the “instr dest opl op2” format, which allows to map similar 
instructions from 32-bit processors directly, as well as instructions from 8-bit processors which only 
have two operands, by using the same register for dest and opl for instance. 

3.2 Code Transformation 

Bitsliced code. As seen in Sec. DPL works at the bit level. Transforming code to make it 
DPL compliant thus requires this level of granularity. Bitslicing is possible on any algorithrrj^ but 
we found that bitslicing an algorithm is hard to do automatically. In practice, every bitslice imple¬ 
mentations we found were hand-crafted. However, since Biham presented his bitslice paper |Bih97| . 
many block ciphers have been implemented in bitslice for performance reasons, which mitigate this 
concern. So, for the sake of simplicity, we assume that the input code is already bitsliced. 

DPL macros expansion. This is the main point of the transformation of the code. 

Definition 1 (Sensitive value). A value is said sensitive if it depends on sensitive data. A sensitive 
data depends on the secret key or the plaintexl|^ 

"^Intuitively, the proof invokes the Universal Turing Machines equivalence (those that work with only {0,1} as 
alphabet are as powerful as the others). 

® Other works consider that a sensitive data must depend on both the secret key and the plaintext (as it is usually 
admitted in the “only computation leaks" paradigm; see for instance [RPIOI §4.1]). Our dehnition is broader, in 
particular it also encompasses the random probing model |ISW0,S |. 















Table 1: Look-up tables for and, or, and xor. 
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Definition 2 (Sensitive instruction). We say that an instruction is sensitive if it may modify the 
Hamming weight of a sensitive value. 

All the sensitive instructions must be expanded to a DPL macro. 

Thus, all the sensitive data must be transformed too. Each literal (“im¬ 
mediate” values in assembly terms), memory cells that contain initialized 
constant data (look-up tables, etc.), and registers values need to be DPL 
encoded. For instance, using the two least significant bits, the Is stay 
Is (01) and the Os become 2s (10). 

Since the implementation is bitsliced, only the logical (bit level) op¬ 
erators are used in sensitive instructions (and, or, xor, Isl, Isr, and 
not). To respect the DPL protocol, not instructions are replaced by 
xor which inverse the positive logic and the negative logic bits of DPL 
encoded values. For instance if using the two least significant bits for 
the DPL encoding, not a 6 is replaced by xor a b #3. Bitsliced code 
never needs to use shift instructions since all bits are directly accessible. 

Moreover, we currently run this code transformation only on block 
ciphers. Given that the code is supposed to be bitsliced, this means 
that the branching and arithmetic instructions are either not used or 
are used only in a deterministic way {e.g., looping on the round counter) 
that does not depend on sensitive information. 

Thus, only and, or, and xor instructions need to be expanded to DPL macros such as the one 
shown in Fig.|^ This macro has the advantage that it actually uses two operands instructions only 
(when there are three operands in our generic assembly language, the destination is the same as 
one of the two others), which makes its instructions mappable one-to-one even with 8-bit assembly 
languages. 
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Figure ^ 

1: DPL macro 

of Fig. 

|in 

assembly. 


Look-up tables. As they appear in the DPL macro, the addresses of look-up tables are sensitive 


too. As seen in Sec. 2^, the look-up tables must be located at an address which is a multiple of 16 
so that the last four bits are available when adding the offset (in the case where we use the last four 
bits to place the two DPL encoded operands). Tab. [^present the 16 values present in the look-up 
tables for and, or, and xor. 

Values in the look-up tables which are not at DPL valid addresses, i.e., addresses which are not 
a concatenation of 01 or 10 with 01 or 10, are preferentially DPL invalid, i.e., 00 or 11. Like this 
if an error occurs during the execution (such as a fault injection for instance) it poisons the result 
and all the subsequent computations will be faulted too (infective computation). 
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Table 2: and dab. 


Table 3: DPL and dab. 




a, 6, d 



a, 6, d 


Before 

0, 0, ? 

0, 1, ? 

1, 0, ? 

1, 1, ? 

10, 10, ? 

10, 01, ? 

01, 10, ? 

01, 01, ? 

After 

0, 0, 0 

0, 1, 0 

1, 0, 0 

1, 1, 1 

10, 10, 10 

10, 01, 10 

01, 10, 10 

01, 01, 01 


3.3 Correctness Proof of the Transformation 

Formally proving the correctness of the transformation requires to define what we intend by “cor¬ 
rect”. Intuitively, it means that the transformed code does the “same thing” as the original one. 

Definition 3 (Correct DPL transformation). Let S' be a valid state of the system (values in 
registers and memory). Let c be a sequence of instructions of the system. Let S be the state of the 
system after the execution of c with state S, we denote that by S S. We write dpl{S) for the 
DPL state (with DPL encoded values of the Is and Os in memory and registers) equivalent to the 
state S. 

c ^ c' ^ 

We say that c is a correct DPL transformation of c if S —)• S =► dpl{S) —)• dpl{S). 

Proposition 1 (Correctness of our code transformation). The expansion of the sensitive instruc¬ 
tions into DPL macros such as presented in Sec. \2.S\ is a correct DPL transformation. 

Proof. Let a and b be instructions. Let c be the code a; b (instruction a followed by instruction b). 
Let X, y, and Z be states of the program. If we have X Y and Y \ Z, then we know that 
X Z (by transitivity). 

Let a' and b' be the DPL macro expansions of instructions a and b. Let c' be the DPL transfor¬ 
mation of code c. Since the expansion into macros is done separately for each sensitive instruction, 
without any other dependencies, we know that c' is a'; 6'. 

If we have dpl{X) dpl{Y) and dpl{Y) dpl{Z), then we know that dpl{X) dpl{Z). 

This means that a chain of correct transformations is a correct transformation. Thus, we only 
have to show that the DPL macro expansion is a correct transformation. 

Let us start with the and operation. Since the code is bitsliced, there are only four possibilities. 
Tab. shows these possibilities for the and dab instruction. 

Tab. shows the evolution of the values of a, b, and d during the execution of the macro which 
and dab expands to. We assume the look-up table for and is located at address and. Tab. 
sums up the Tab. |^in the same format as Tab. 

This proves that the DPL transformation of the and instructions are correct. The demonstration 
is similar for or and xor operations. □ 

The automatic DPL transformation of arbitrary assembly code has been implemented in our 
tool described in App. 

4 Formally Proving the Absence of Leakage 

Now that we know the DPL transformation is correct, we need to prove its efficiency security-wise. 
We prove the absence of leakage on the software, while obviously the leakage heavily depends on 
the hardware. Our proof thus makes an hypothesis on the hardware: we suppose that the bits 
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we use for the positive and negative logic in the DPL protocol leak the same amount. This may 
seem like an unreasonable hypothesis, since it is not true in general. However, the protection can 
be implemented in a soft CPU core (LatticeMicro32, OpenRISC, LEON2, etc.), that would be 
laid out in a Field-Programmable Gate Array (FPGA) or in an Application-Specific Integrated 
Circuit (ASIC) with special balancing constraints at place-and-route. The methodology follows the 
guidelines given by Chen et al. in |CSS13j . Moreover, we will show in Sec. 4.2 how it is possible, 
using stochastic profiling, to find bits which leakages are similar enough for the DPL countermeasure 
to be sufficiently efficient even on non-specialized hardware. That said, it is important to note that 
the difference in leakage between two bits of the same register should not be large enough for the 
attacker to break the DPL protection using SPA or ASCA. 

Formally proving the balance of DPL code requires to properly define the notions we are using. 


Definition 4 (Leakage model). The attacker is able to measure the power consumption of parts 
of the cryptosystem. We model power consumption by the Hamming distance of values updates, 
z.e., the number of bit flips. It is a commonly accepted model for power analysis, for instance 
with DPA |KJJ99j or Correlation Power Analysis (CPA) |BCO04j . We write H{a, b) the Hamming 
distance between the values a and b. 


Definition 5 (Constant activity). The activity of a cryptosystem is said to be constant if its power 
consumption does not depend on the sensitive data and is thus always the same. 

Formally, let P{s) be a program which has s as parameter {e.g., the key and the plaintext). 
According to our leakage model, a program P{s) is of constant activity if: 

• for every values si and S 2 of the parameter s, for each cycle z, for every sensitive value u, v 
is updated at cycle i in the run of P{si) if and only if it is updated also at cycle i in the run 
of P{s 2 ); 

• whenever an instruction modifies a sensitive value from v to u', then the value of H{v,v') 
does not depend on s. 

Remark 1. The first condition of Def. [^mostly concerns leakage in the horizontal / time dimension, 
while the second condition mostly concerns leakage in the vertical / amplitude dimension. 
Remark 2. The first condition of Def. [^implies that the runs of the program P{s) are constant 
in time for every s. This implies that a program of constant activity is not vulnerable to timing 
attacks, which is not so surprising given the similarity between SPA and timing attacks. 


4.1 Computed Proof of Constant Activity 


To statically determine if the code is correctly balanced (i.e., that the activity of a given program 
is constant according to Def. [^, our tool relies on symbolic execution. The idea is to run the code 
of the program independently of the sensitive data. This is achieved by computing on sets of all the 
possible values instead of values directly. The symbolic execution terminates in our case because 
we are using the DPL protection on block ciphers, and we avoid combinatorial explosion thanks 
to bitslicing, as a value can initially be only 1 or 0 (or rather their DPL encoded counterparts). 


Indeed, bitsliced code only use logical instructions as explained in Sec. 3.2, which will always return 
a result in {0,1} when given two values in {0,1} as arguments. 

Our tool implements an interpreter for our generic assembly language which work with sets of 
values. The interpreter is equipped to measure all the possible Hamming distances of each value 


^See Tab. 
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update, and all the possible Hamming weight of values. It watches updates in registers, in memory, 
and also in address buses (since the addresses may leak information when reading in look-up tables). 
If for one of these value updates there are different possible Hamming distances or Hamming weight, 
then we consider that there is a leak of information: the power consumption activity is not constant 
according to Def. 

Example. Let o be a register which can initially be either 0 or 1. Let 6 be a register which can 
initially be only 1. The execution of the instruction orr a a b will set the value of a to be all the 
possible results of a V 6. In this example, the new set of possible values of a will be the singleton 
{1} (since 0 V 1 is 1 and 1 V 1 is 1 too). The execution of this instruction only modified one value, 
that of a. However, the Hamming distance between the previous value of a and its new value can 
be either 0 (in case a was originally 1) or 1 (in case a was originally 0). Thus, we consider that 
there is a leak. 

By running our interpreter on assembly code, we can statically determine if there are leakages 
or if the code is perfectly balanced. For instance for a block cipher, we initially set the key and the 
plaintext (i.e., the sensitive data) to have all their possible values: all the memory cells containing 
the bits of the key and of the plaintext have the value {0,1} (which denotes the set of two elements: 
0 and 1). Then the interpreter runs the code and outputs all possible leakage; if none are present, 
it means that the code is well balanced. Otherwise we know which instructions caused the leak, 
which is helpful for debugging, and also to locate sensitive portions of the code. 

For an example in which the code is balanced, we can refer to the execution of the and DPL 
macro shown in Tab. There we can see that the Hamming distance of the updates does not 
depend on the values of a and b. We also note that at the end of the execution (and actually, all 
along the execution) the Hamming weight of each value does not depend on a and b either. This 
allows to chain macros safely: each value is precharged with 0 before being written to. 


4.2 Hardware Characterization 


The DPL countermeasure relies on the fact that the pair of bits used to store the DPL encoded 
values leak the same way, i.e., that their power consumptions are the same. This property is 
generally not true in non-specialized hardware. However, using the two closest bits (in terms 
of leakage) for the DPL protocol still helps reaching a better immunity to side-channel attacks, 
especially ASCAs that operate on a limited number of traces. 

The idea is to compute the leakage level of each of the bits during the execution of the algorithm, 
in order to choose the two closest ones as the pair to use for the DPL protocol and thus ensure 
an optimal balance of the leakage. This is facilitated by the fact that the algorithm is bitsliced. 
Indeed, it allows to run the whole computation using only a chosen bit while all the others stay 


zero. We will see in Sec. 5.1 how we characterized our smartcard in practice. 


5 Case Study: PRESENT on an ATmegal63 AVR Micro-Controller 
5.1 Profiling the ATmegal63 

We want to limit the size of the look-up tables used by the DPL macros. Thus, DPL macros need 
to be able to store two DPL encoded bits in the four consecutive bits of a register. This lets 13 
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possible DPL encoding layouts on 8-bit. Writing X for a bit that is used and x otherwise, we have: 
1. xxxxxxXX, 
xxxxxXXx, 
xxxxXXxx, 
xxxXXxxx, 
xxXXxxxx, 
xXXxxxxx, 

7. XXxxxxxx, 

8. xxxxxXxX, 

9. xxxxXxXx, 

10. xxxXxXxx, 

11. xxXxXxxx, 

12. xXxXxxxx, 

13. XxXxxxxx. 


2 . 

3. 

4. 

5. 

6 . 


As explained in Sec. 4.2, we want to use the pair of bits that have the closest leakage properties, 
and also which is the closest from the least significant bit, in order to limit the size of the look-up 
tables. 

To profile the AVR chip (we are working with an Atmel ATmegal63 AVR smartcard, which is 
notoriously leaky), we ran eight versions of an unprotected bitsliced implementation of present, 
each of them using only one of the 8 possible bits. We used the Normalized Inter-Class Variance 
(NICV) |BDGN14a] . also called coefficient of determination, as a metric to evaluate the leakage 
level of the variables of each of the 8 versions. Let us denote by L the (noisy and non-injective) 
leakage associated with the manipulation of the sensitive value V, both seen as random variables; 
then the NICV is defined as the ratio between the inter-class and the total variance of the leakage, 
that is: NICV = the Cauchy-Schwarz theorem, we have 0 ^ NICV ^ 1; thus the 

NICV is an absolute leakage metric. A key advantage of NICV is that it detects leakage using public 
information like input plaintexts or output ciphertexts only. We used a fixed key and a variable 
plaintext on which applying NICV gave us the leakage level of all the intermediate variables in 
bijective relation with the plaintext (which are all the sensible data as seen in Def. [^. As we can 
see on the measures plotted in Fig. (which can be found in App. [^, the least significant bit leaks 
very differently from the others, which are roughly equivalent in terms of leakag^ Thus, we chose 
to use the xxxxxXXx DPL pattern to avoid the least significant bit (our goal here is not to use the 
optimal pair of bits but rather to demonstrate the added-value of the characterization). 


5.2 Generating Balanced AVR Assembly 

We wrote an AVR bitsliced implementation of present that uses the S-Box in 14 logic gates from 
Courtois et al. [CHMll] . This implementation was translated in our generic assembly language (see 
Sec. 3.1). The resulting code was balanced following the method discussed in Sec. except that 
we used the DPL encoding layout adapted to our particular smartcard, as explained in Sec. 5.1 
App. 0 presents the code of the adapted DPL macro. The balance of the DPL code was then 
verified as in Sec. Finally, the verified code was mapped back to AVR assembly. All the code 
transformations and the verification were done automatically using our tool. 


'These differences are due to the internal architecture of the chip, for which we don’t have the specifications. 
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Table 5: DPL cost. 



bitslice 

DPL 

cost 

code (B) 

1620 

3056 

xl.88 

RAM (B) 

288 

352 

-f64 

#cycles 

78, 403 

235,427 

x3 


5.3 Cost of the Countermeasure 

The table in Tab. compares the performances of 
the DPL protected implementation of present with 
the original bitsliced version from which the pro¬ 
tected one has been derived. The DPL countermea¬ 
sure multiplies by 1.88 the size of the compiled code. 

This low factor can be explained by the numerous 
instructions which it is not necessary to transform 
(the whole permutation layer of the present algo¬ 
rithm is left as is for instance). The protected version uses 64 more bytes of memory (sparsely, for 
the DPL macro look-up tables). It is also only 3 times sloweij^ or 24 times if we consider that the 
original bitsliced but unprotected code could operate on 8 blocks at a time. 

Note that these experimental results are only valid for the present algorithm on the Atmel 
ATmegal63 AVR device we used. Further work is necessary to compare these results to those which 
would be obtained with other algorithms such as Advanced Encryption Standard (AES), and on 
other platforms such as ARM processors. 


5.4 Attacks 

We attacked three implementations of the present algorithm: a bitsliced but unprotected one, a 
DPL one using the two less significant bits, and a DPL one using two bits that are more balanced 


in term of leakage (as explained in Sec. 5.1). On each of these, we computed the success rate of 


using monobit CPA of the output of the S-Box as a model. The monobit model is relevant because 
only one bit of sensitive data is manipulated at each cycle since the algorithm is bitsliced, and also 
because each register is precharged at 0 before a new intermediate value is written to it, as per the 
DPL protocol prescribe. Note that this means we consider the resistance against first-order attacks 
only. Actually, we are precisely in the context of [MOSllj . where the efficiency of correlation and 
Bayesian attacks gets close as soon as the number of queries required to perform a successful attack 
is large enough. This justifies our choice of the CPA for the attack evaluation. 

The results are shown in Fig. (which can be found in App. D.2). They demonstrate that the 
first DPL implementation is at least 10 times more resistant to first-order power analysis attacks 
(requiring almost 1, 500 traces) than the unprotected one. The second DPL implementation, which 
takes the chip characterization into account, is 34 times more resistant (requiring more than 4, 800 
traces). 

Interpreting these results requires to bear in mind that the attaeks setting was largely to the 
advantage of the attacker. In fact, these results are very pessimistic: we used our knowledge of 
the key to select a narrow part of the traces where we knew that the attack would work, and we 
used the NICV [BDGNlia] to select the point where the SNR of the CPA attack is the highest 
(see similar use cases of NICV in |BDGN14b] ). We did this so we could show the improvement in 
security due to the characterization of the hardware. Indeed, without this “cheating attacker” (for 
the lack of a better term), i.e., when we use a monobit CPA taking into account the maximum 


^Notice that present is inherently slow in softwa re (optimi zed non-bitsliced assembly is reported to run in about 
11,000 clock cycles on an Atmel ATtiny 45 device [EGG^l^ i because it is designed for hardware. Typically, the 
permutation layer is free in hardware, but requires many bit-level manipulations in software. Nonetheless, we precise 
that there are contexts where present must be supported, but no hardware accelerator is available. 
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of correlation over the full round, as a normal attacker would do, the unprotected implementation 
breaks using about 400 traces (resp. 138 for the “cheating attacker”), while the poorly balanced 
one is still not broken using 100, 000 traces (resp. about 1, 500). We do not have more traces than 
that so we can only say that with an experimental SNR of 15 (which is quite large so far), the 
security gain is more than 250 x and may be much higher with the hardware characterization taken 
into account as our results with the “cheating attacker” shows. 

As a comparisorj^ an unprotected AES on the same smartcard breaks in 15 traces, and in 336 
traces with a hrst order masking scheme using less powerful attack setting (see success rates of 
masking in App. D.l), hence a security gain of 22x. Besides, 
protection thwarts ASCAs. Indeed, 


we notice that our software DPL 
ASCAs require a high signal to noise ratio on a single trace. 


This can happen both on unprotected and on masked implementation. However, our protection 
aims at theoretically cancelling the leakage, and practically manages to reduce it significantly, 
even when the chosen DPL bit pair is not optimal. Therefore, coupling software DPL with key- 
update [MSGRIO] allows to both prevent against fast attacks on few traces (ASCAs) and against 
attacks that would require more traces (regular CPAs). 


6 Conclusions and Perspectives 

Contributions. We present a method to protect any bitsliced assembly code by transforming it 
to enforce the Dual-rail with Precharge Logic (DPL) protocol, which is a balancing countermeasure 
against power analysis. We provide a tool which automates this transformation. We also formally 
prove that this transformation is correct, i.e., that it preserves the semantic of the program. 

Independently, we show how to formally prove that assembly code is well balanced. Our tool 
is also able to use this technique to statically determine whether some arbitrary assembly code’s 
power consumption activity is constant, i.e., that it does not depend on the sensitive data. In this 
chapter we used the Hamming weight of values and the Hamming distance of values update as 
leakage models for power consumption, but our method is not tied to it and could work with any 
other leakage models that are computable. We present how to characterize the targeted hardware 
to make use of the resources which maximize the relevancy of our leakage model to run the DPL 
protocol. 

We then applied our methods using our tool using an implementation of the present cipher on 
a real smartcard, which ensured that our methods and models are relevant in practice. In our case 
study, the provably balanced DPL protected implementation is at least 250 times more resistant to 
power analysis attacks than the unprotected version while being only 3 times slower. These figures 
could be better. Indeed, they do not take into account hardware characterization which helps the 
balancing a lot, as we were able to see with the “cheating attacker”. Moreover, we have used the 
hardware characterization data grossly, only to show the added-value of the operation, which as 
expected is non-negligible. And of course interpreting our figures require to take into account that 
the ATmegal63, the model of smartcard that we had at our disposal, is notoriously leaky. 

These results show that software balancing countermeasures are realistic: our formally proved 
countermeasure is an order of magnitude less costly than the state of the art of formally proved 
masking [RPTn] . 

®We insist that the comparison between two security gains is very platform-dependent. The figures we give are 
only valid on our specific setup. Of course, for different conditions, e.g., lower signal-to-noise ratio, masking might 
become more secure than DPL. 
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Future work. The first and foremost future work surely is that our methods and tools need to 
be further tested in other experimental settings, across more hardware platforms, and using other 
cryptographic algorithms. 

We did not try to optimize our PRESENT implementation (neither for speed nor space). However, 
automated proofs enable optimization: indeed, the security properties can be checked again after 
any optimization attempt (using proofs computation as non-regression tests, either for changes in 
the DPL transformation method, or for handcrafted optimizations of the generated DPL code). 

Although the mapping from the internal assembly of our tool to the concrete assembly is 
straightforward, it would be better to have a formal correctness proof of the mapping. 

Our work would also benefit from automated bitslicing, which would allow to automatically 
protect any assembly code with the DPL countermeasure. However, it is still a challenging issue. 

Finally, the DPL countermeasure itself could be improved: the pair of bits used for the DPL 
protocol could change during the execution, or more simply it could be chosen at random for 
each execution in order to better balance the leakage among multiple traces. Besides, unused bits 
could be randomized instead of being zero in order to add noise on top of balancing, and thus 
reinforce the hypotheses we make on the hardware. An anonymous reviewer of the PROOFS 2014 
workshop suggested that randomness could instead be used to mask the intermediate bits. Indeed, 
the reviewer thinks that switching bus lines may only increase noise, while masking variables may 
provide sound resistance, at least at first order. The resulting method would therefore: 1. gain 
both the Ist-order resistance of masking countermeasures and the significant flexibility of software- 
defined countermeasures; 2. still benefit from the increase of resistance resorting to the use of the 
DPL technique, as demonstrated by present chapter. This suggestion is of course only intuitive 
and lacks argumentation based on precise analysis and calculation. 

We believe formal methods have a bright future eoncerning the certification of side-channel 
attacks countermeasures (including their implementation in assembly) for trustable cryptosystems. 


References 


[BCO04] Eric Brier, Christophe Clavier, and Francis Olivier. Correlation Power Analysis with 
a Leakage Model. In CHES, volume 3156 of LNCS, pages 16-29. Springer, August 
11-13 2004. Cambridge, MA, USA. 


[BDGN14a] Shivam Bhasin, Jean-Luc Danger, Sylvain Guilley, and Zakaria Najm. NICV: Nor¬ 
malized Inter-Class Variance for Detection of Side-Channel Leakage. In International 
Symposium on Electromagnetic Compatibility (EMC ’If / Tokyo). IEEE, May 12-16 
2014. Session OS09: EM Information Leakage. Hitotsubashi Hall (National Center of 
Sciences), Chiyoda, Tokyo, Japan. 


[BDGN14b] Shivam Bhasin, Jean-Luc Danger, Sylvain Guilley, and Zakaria Najm. Side-channel 
Leakage and Trace Compression Using Normalized Inter-class Variance. In Proceed¬ 
ings of the Third Workshop on Hardware and Architectural Support for Security and 
Privacy, HASP T4, pages 7:l-7:9, New York, NY, USA, 2014. ACM. 


[BG13] Alberto Battistello and Christophe Giraud. Fault Analysis of Infective AES Compu¬ 
tations. In Wieland Fischer and Jorn-Marc Schmidt, editors, 2013 Workshop on Eault 


17 



[Bih97] 

[BKL+07] 

[CESY14] 

[CFGR12] 

[CGP+12] 

[GHMll] 

[Gonl3] 

[GPR07] 

[GSS13] 

[DFK+13] 

[FGG+12] 


Diagnosis and Tolerance in Cryptography, Los Alamitos, CA, USA, August 20, 2013, 
pages 101-107. IFFF, 2013. Santa Barbara, GA, USA. 

Eli Biham. A Fast New DES Implementation in Software. In Eli Biham, editor, FSE, 
volume 1267 of Lecture Notes in Computer Seience, pages 260-272. Springer, 1997. 

Andrey Bogdanov, Lars R. Knudsen, Gregor Leander, Ghristof Paar, Axel Poschmann, 
Matthew J. B. Robshaw, Yannick Seurin, and Gharlotte Vikkelsoe. PRESENT: An 
Ultra-Lightweight Block Gipher. In CHES, volume 4727 of LNCS, pages 450-466. 
Springer, September 10-13 2007. Vienna, Austria. 

Gong Ghen, Thomas Eisenbarth, Aria Shahverdi, and Xin Ye. Balanced Encoding to 
Mitigate Power Analysis: A Gase Study. In CARDIS, Lecture Notes in Gomputer 
Science. Springer, November 2014. Paris, France. 

Glaude Garlet, Jean-Gharles Faugere, Ghristopher Goyet, and Guenael Renault. Anal¬ 
ysis of the algebraic side channel attack. J. Cryptographic Engineering, 2(l):45-62, 
2012. 

Glaude Garlet, Louis Goubin, Emmanuel Prouff, Michael Quisquater, and Matthieu 
Rivain. Higher-Order Masking Schemes for S-Boxes. In Anne Ganteaut, editor. East 
Software Encryption - 19th International Workshop, FSE 2012, Washington, DC, 
USA, March 19-21, 2012. Revised Selected Papers, volume 7549 of Lecture Notes in 
Computer Scienee, pages 366-384. Springer, 2012. 

Nicolas Gourtois, Daniel Hulme, and Theodosis Mourouzis. Solving Gircuit Optimisa¬ 
tion Problems in Gryptography and Gryptanalysis. lACR Cryptology ePrint Archive, 
2011:475, 2011. (Also presented in SHARGS 2012, Washington DG, 17-18 March 2012, 
on page 179). 

Gommon Griteria Gonsortium. Gommon Griteria {aka GG) for Information Technology 
Security Evaluation (ISO/IEG 15408), 2013. 

Website: http: //www. commoncriteriaportal. org/. 

Jean-Sebastien Goron, Emmanuel Prouff, and Matthieu Rivain. Side Ghannel Grypt¬ 
analysis of a Higher Order Masking Scheme. In Pascal Paillier and Ingrid Verbauwhede, 
editors, CHES, volume 4727 of LNCS, pages 28-44. Springer, 2007. 

Zhimin Ghen, Ambuj Sinha, and Patrick Schaumont. Using Virtual Secure Gircuit 
to Protect Embedded Software from Side-Ghannel Attacks. IEEE Trans. Computers, 
62(1):124-136, 2013. 

Goran Doychev, Dominik Feld, Boris Kopf, Laurent Mauborgne, and Jan Reineke. 
GacheAudit: A Tool for the Static Analysis of Gache Side Ghannels. lACR Cryptology 
ePrint Archive, 2013:253, 2013. 

Thomas Eisenbarth, Zheng Gong, Tim Giineysu, Stefan Heyse, Sebastiaan Indesteege, 
Stephanie Kerckhof, Frangois Koeune, Tomislav Nad, Thomas Plos, Francesco Regaz- 
zoni, Frangois-Xavier Standaert, and Lo’ic van Oldeneel tot Oldenzeel. Gompact Im¬ 
plementation and Performance Evaluation of Block Giphers in ATtiny Devices. In 


18 



[GCS+08] 

[GHMP05] 

[GMll] 

[HDD 11] 
[IPSW06] 

[ISW03] 

[KB07] 

[KD09] 

[KJJ96] 

[KJJ99] 

[MAM+03] 


Aikaterini Mitrokotsa and Serge Vaudenay, editors, AFRICACRYPT, volume 7374 of 
Lecture Notes in Computer Science, pages 172-187. Springer, 2012. 

Sylvain Guilley, Sumanta Chaudhuri, Laurent Sauvage, Philippe Hoogvorst, Renaud 
Pacalet, and Guido Marco Bertoni. Security Evaluation of WDDL and SecLib Gounter- 
measures against Power Attacks. IEEE Transactions on Computers, 57(11):1482~1497, 
nov 2008. 

Sylvain Guilley, Philippe Hoogvorst, Yves Mathieu, and Renaud Pacalet. The “Back¬ 
end Duplication” Method. In CHES, volume 3659 of LNCS, pages 383-397. Springer, 
2005. August 29th - September 1st, Edinburgh, Scotland, UK. 

Tim Giineysu and Amir Moradi. Generic side-channel countermeasures for reconfig- 
urable devices. In Bart Preneel and Tsuyoshi Takagi, editors, CHES, volume 6917 of 
LNCS, pages 33-48. Springer, 2011. 

Philippe Hoogvorst, Jean-Luc Danger, and Guillaume Due. Software Implementation 
of Dual-Rail Representation. In COSADE, February 24-25 2011. Darmstadt, Germany. 

Yuval Ishai, Manoj Prabhakaran, Amit Sahai, and David Wagner. Private Gircuits H: 
Keeping Secrets in Tamperable Gircuits. In EUROCRYPT, volume 4004 of Lecture 
Notes in Computer Science, pages 308-327. Springer, May 28 - June 1 2006. St. 
Petersburg, Russia. 

Yuval Ishai, Amit Sahai, and David Wagner. Private Gircuits: Securing Hardware 
against Probing Attacks. In CRYPTO, volume 2729 of Lecture Notes in Computer 
Science, pages 463-481. Springer, August 17-21 2003. Santa Barbara, Galifornia, 
USA. 

Boris Kopf and David A. Basin. An information-theoretic model for adaptive side- 
channel attacks. In Peng Ning, Sabrina De Gapitani di Vimercati, and Paul F. Syver- 
son, editors, ACM Conference on Computer and Communications Security, pages 286- 
296. ACM, 2007. 

Boris Kopf and Markus Diirmuth. A provably secure and efficient countermeasure 
against timing attacks. In CSP, pages 324-335. IEEE Computer Society, 2009. 

Paul C. Kocher, Joshua Jaffe, and Benjamin Jun. Timing Attacks on Implementations 
of Diffie-Hellman, RSA, DSS, and Other Systems. In Proceedings of CRYPTO’96, 
volume 1109 of LNCS, pages 104-113. Springer-Verlag, 1996. 

Paul C. Kocher, Joshua Jaffe, and Benjamin Jun. Differential Power Analysis. In 
Proceedings of CRYPTO’99, volume 1666 of LNCS, pages 388-397. Springer-Verlag, 
1999. 

Simon Moore, Ross Anderson, Robert Mullins, George Taylor, and Jacques J.A. 
Fournier. Balanced Self-Checking Asynchronous Logic for Smart Card Applications. 
Journal of Microprocessors and Microsystems, 27(9):421-430, October 2003. 


19 



[MO 12] 


[MOP06] 

[MOPT12] 

[MOSll] 

[MS06] 

[MSGRIO] 

[NBD+10] 

[PM05] 

[RPIO] 

[RS09] 

[RSVC09] 

[SBG+09] 


Luke Mather and Elisabeth Oswald. Pinpointing side-channel information leaks in 
web applications. J. Cryptographic Engineering, 2(3):161-177, 2012. 

Stefan Mangard, Elisabeth Oswald, and Thomas Popp. Power Analysis Attacks: Re¬ 
vealing the Secrets of Smart Cards. Springer, December 2006. ISBN 0-387-30857-1, 
http://www.dpabook.org/. 

Andrew Moss, Elisabeth Oswald, Dan Page, and Michael Tunstall. Gompiler Assisted 
Masking. In Emmanuel Prouff and Patrick Schaumont, editors, CHES, volume 7428 
of LNCS, pages 58-75. Springer, 2012. 

Stefan Mangard, Elisabeth Oswald, and Erangois-Xavier Standaert. One for All - All 
for One: Unifying Standard DPA Attacks. Information Security, lET, 5(2):100-111, 
2011. ISSN: 1751-8709 ; Digital Object Identifier: 10.1049/iet-ifs.2010.0096. 

Stefan Mangard and Kai Schramm. Pinpointing the Side-Channel Leakage of Masked 
AES Hardware Implementations. In CHES, volume 4249 of LNCS, pages 76-90. 
Springer, October 10-13 2006. Yokohama, Japan. 

Marcel Medwed, Erangois-Xavier Standaert, Johann Grofischadl, and Erancesco Regaz- 
zoni. Eresh Re-Keying: Security against Side-Channel and Eault Attacks for Low-Cost 
Devices. In AERICACRYPT, volume 6055 of LNCS, pages 279-296. Springer, May 
03-06 2010. Stellenbosch, South Africa. DOI: 10.1007/978-3-642-12678-9_17. 

Maxime Nassar, Shivam Bhasin, Jean-Luc Danger, Guillaume Due, and Sylvain Guil- 
ley. BCDL: A high performance balanced DPL with global precharge and without 
early-evaluation. In DATE’IO, pages 849-854. IEEE Computer Society, March 8-12 
2010. Dresden, Germany. 

Thomas Popp and Stefan Mangard. Masked Dual-Rail Pre-charge Logic: DPA- 
Resistance Without Routing Constraints. In Josyula R. Rao and Berk Sunar, editors. 
Cryptographic Hardware and Embedded Systems - CHES 2005, volume 3659 of LNCS, 
pages 172-186. Springer, 2005. 

Matthieu Rivain and Emmanuel Prouff. Provably Secure Higher-Order Masking of 
AES. In Stefan Mangard and Erangois-Xavier Standaert, editors, CHES, volume 6225 
of LNCS, pages 413-427. Springer, 2010. 

Mathieu Renauld and Erangois-Xavier Standaert. Algebraic Side-Channel Attacks. In 
Eeng Bao, Moti Yung, Dongdai Lin, and Jiwu Jing, editors, Inscrypt, volume 6151 of 
Lecture Notes in Computer Science, pages 393-410. Springer, 2009. 

Mathieu Renauld, Erangois-Xavier Standaert, and Nicolas Veyrat-Charvillon. Alge¬ 
braic Side-Channel Attacks on the AES: Why Time also Matters in DPA. In CHES, 
volume 5747 of Lecture Notes in Computer Science, pages 97-111. Springer, September 
6-9 2009. Lausanne, Switzerland. 

Nidhal Selmane, Shivam Bhasin, Sylvain Guilley, Tarik Graba, and Jean-Luc Danger. 
WDDL is Protected Against Setup Time Violation Attacks. In FDTC, pages 73- 
83. IEEE Computer Society, September 6th 2009. In conjunction with CHES’09, 


20 



[SDMB14] 

[SEE98] 

[SP06] 

[TPR13] 

[TV04a] 

[TV04b] 

[TV06] 

[ZJRR12] 


Lausanne, Switzerland. DOI: 10.1109/FDTC.2009.40; Online version: http://hal. 
archives-ouvertes.fr/hal-00410135/en/, 

Victor Servant, Nicolas Debande, Houssem Maghrebi, and Julien Bringer. Study of 
a Novel Software Constant Weight Implementation. In CARDIS, Lecture Notes in 
Computer Science. Springer, November 2014. Paris, France. 

Maitham Shams, Jo. C. Ebergen, and Mohamed 1. Elmasry. Modeling and comparing 
CMOS implementations of the C-Element. IEEE Transactions on VLSI Systems, 
6(4):563-567, December 1998. 

Kai Schramm and Christof Paar. Higher Order Masking of the AES. In David 
Pointcheval, editor, CT-RSA, volume 3860 of LNCS, pages 208-225. Springer, 2006. 

Adrian Thillard, Emmanuel Prouff, and Thomas Roche. Success through Confidence: 
Evaluating the Effectiveness of a Side-Channel Attack. In Guido Bertoni and Jean- 
Sebastien Coron, editors, CHES, volume 8086 of Lecture Notes in Computer Science, 
pages 21-36. Springer, 2013. 

Kris Tiri and Ingrid Verbauwhede. A Logic Level Design Methodology for a Secure 
DPA Resistant ASIC or FPGA Implementation. In DATE’04, pages 246-251. IEEE 
Gomputer Society, February 2004. Paris, France. DOI: 10.1109/DATE.2004.1268856. 

Kris Tiri and Ingrid Verbauwhede. Place and Route for Secure Standard Gell De¬ 
sign. In Kluwer, editor, Proceedings of WCC / CARDIS, pages 143-158, Aug 2004. 
Toulouse, France. 

Kris Tiri and Ingrid Verbauwhede. A digital design flow for secure integrated circuits. 
IEEE Trans, on CAD of Integrated Circuits and Systems, 25(7):1197-1208, 2006. 

Yinqian Zhang, Ari duels, Michael K. Reiter, and Thomas Ristenpart. Gross-VM 
side channels and their use to extract private keys. In Ting Yu, George Danezis, 
and Virgil D. Gligor, editors, ACM Conference on Computer and Communications 
Security, pages 305-316. AGM, 2012. 


21 



A paioli 

The goal of paiolp^ {Power Analysis Immunity by Offsetting Leakage Intensity) is to protect as¬ 
sembly code against power analysis attacks such as DPA (differential power analysis) and CPA 
(correlation power analysis), and to formally prove the efficiency of the protection. To this end, 
it implements the automatic insertion of a balancing countermeasure, namely DPT (dual-rail with 
precharge logic), in assembly code (for now limited to bitsliced block-cipher type of algorithms). 
Independently, it is able to statically verify if the power consumption of a given assembly code 
is correctly balanced with regard to a leakage model {e.g., the Hamming weight of values, or the 
Hamming distance of values updates). 


paioli [options] <input-file> 

-bf Bit to use as F is DPL macros (default: 1) 

-bt Bit to use as T is DPL macros (default: 0) 

-po Less significant bit of the DPL pattern for DPL LUT access 
(default: 0) 

-cl Compact the DPL look-up table (LUT) if present 

-la Address in memory where to put the DPL LUT (default: 0) 

-rl Register number of one of the three used by DPL macros 
(default: 20) 

-r2 Register number of one of the three used by DPL macros 
(default: 21) 

-r3 Register number of one of the three used by DPL macros 
(default: 22) 

-a Adapter for custom assembly language 
-o asm output (default: no output) 

-1 Only check syntax if present 

-d Perform DPL transformation of the code if present 
-V Perform leakage verification if present 
-s Perform simulation if present 
-r Register count for simulation (default: 32) 

-m Memory size for simulation (default: 1024) 

-M range of memory to display after simulation 
-R range of registers to display after simulation 

The rest of this section details its features. 


Adapters. To easily adapt it to any assembly language, it has a system of plugins (which we call 
“adapters”) that allows to easily write a parser and a pretty-printer for any language and to use 
them instead of the internal parser and pretty-printer (which are made for the internal language 


we use, see Sec. 3.1) without having to recompile the whole tool. 


DPL transformation. If asked so, paioli is able to automatically apply the DPL transformation 


as explained in Sec. 3.2 It takes as arguments which bits to use for the DPL protocol, the offset at 
which to place the pattern for look-up tables (for example, we used an offset of 1 to avoid resorting 
to the least significant bit which leaks differently), and where in memory should the look-up tables 
start. Given these parameters, the tool verifies that they are valid and consistent according to 
the DPL protocol, and then it generates the DPL balanced code corresponding to the input code. 


'^http: //pablo . rauzy .name/sensi/paioli .html 


22 






including the code for look-up tables initialization. Optionally, the tool is able to compact the look¬ 
up tables (since they are sparse), still making sure that their addresses respect the DPL protocol 
(Sec.l^. 


Simulation. If asked so, paioli can simulate the execution of the code after its optional DPL 
transformation. The simulator is equipped to do the balance verification proof (see Sec. but it 
is not mandatory to do the balance analysis when running it. It takes as parameters the size of 
the memory and the number of register to use, and initializes them to the set of two DPL encoded 
values of 1 and 0 corresponding to the given DPL parameters. The tool can optionally display the 
content of selected portions of the memory or of chosen registers after execution, which is useful 
for inspection and debugging purpose for example. 


Balance verification. The formal verification of the balance of the code is an essential function¬ 
ality of the tool. Indeed, bugs occur even when having a thorough and comprehensive specification, 
thus we believe that it is not sufficient to have a precise and formally proven method for generating 
protected code, but that the results should be independently verified (see Sec. [^. 


B Characterization of the Atmel ATmegal63 AVR Micro-Controller 


Fig. shows the leakage level computed using NICV |BDGN14a] for each bit of the Atmel AT- 
megal63 AVR smartcard that we used for our tests (see Sec. 5.1). We can see the first bit leaks 
very differently from the others. Thus it is not a good candidate to appear in the bit pair used for 
the DPL protocol. 



bit 0,bit 1,bit 2,bit 3,bit 4,bit 5,bit 6,bit 7 


Time (restarts for each bit) 


Figure 5: Leakage during unprotected encryption for each bit on ATmegal63. 


C DPL Macro for the AVR Micro-Controller 


Once we profiled our smartcard as described in Sec. |5.1t we decided to use the bits 1 and 2 for the 
DPL protocol (xxxxxXXx), that is, the DPL value of 1 becomes 2 and the DPL value of 0 becomes 
4. To avoid using the least significant bit (which leaks very differently from the others), we decided 
to align the two DPL bits for look-up table access starting on the bit 1 rather than 0 (xxxXXXXx). 
With these settings, the DPL macro automatically generated by paioli is presented in Fig. [6] (it 
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(a) Univariate CPA attack on unprotected AES. 


(b) Bi-variate 20-CPA on Ist-order protected AES. 




Figure 7: Attacking AES on the ATmegal63: success rates. 


follows the same conventions as Fig. [^. As we can see the only modihcation is the mask applied 
in the logical and instructions which is now 6 instead of 3 to reflect the new DPL pattern. 

Note that the least significant bit is now unused by the DPL 
protocol and allowed paioli to compact the look-up tables used by 
the DPL macros. Indeed, their addresses need to be of the form 
/. +0000. / leaving the least significant bit free and thus allowing to 
interleave two look-up tables one on another without overlapping of 
their actually used cells (see Sec. 3.2). 


D Attacks 

D.l Attack results on masking (AES) 

For the sake of comparison, we provide attack results on the same 
smartcard tested with the same setup. Figure shows the success 
rate for the attack on the first byte of an AES. 

We estimate the number of traces for a successful attack as the ab¬ 
scissa where the success rate curve first intersects the 80% horizontal 
line. 
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Figure 6: DPL macro for 
d = a op 6 on the 
ATmegaldS. 


D.2 Attack results on DPL (present) 

Fig. shows the success rates and the correlation curves when attacking our three implementations 
of PRESENT. The sensitive variable we consider is in line with the choice of Kocher et al. in their 
CRYPTO’99 paper |KJJ99j : it is the least significant bit of the output of the substitution boxes 
(that are 4 X 4 in present). 

In Fig. we give, for the unprotected bitslice implementation, the correspondence between the 
operations of present and the NICV trace. The zones of largest NICV correspond to operations 
that access (read or write) sensitive data in RAM. To make the attacks more powerful, they are not 
done on the maximal correlation point over the full hrst round of PRESENTp^ (500,000 samples), 

^^Note that using the maximum correlation point to attack the DPL implementations resulted in the success 
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but rather on a smaller interval (of only 140 samples, i.e., one clock period of the device) of high 
potential leakage revealed by the NICV computations, namely sBoxLayer. 

This makes the attack much more powerful and has to be taken into account when interpreting 
its results. In fact, the results we present are very pessimistic: we used our knowledge of the key 
to select a narrow part of the traces where we knew that the attack would work, and we used the 
NICV [BDGNli^ to select the point where the SNR of the CPA attack is the highest. We did this 
so we could show the improvement in security due to the characterization of the hardware. Indeed, 
without this “cheating attacker” (for the lack of a better term), i.e., when we use a monobit CPA 
taking into account the maximum of correlation over the full round, as a normal attacker would 
do, the unprotected implementation breaks using about 400 traces (resp. 138 for the “cheating 
attacker”), while the poorly balanced one is still not broken using 100,000 traces (resp. about 
1, 500). We do not have more traces than that so we can only say that with an experimental SNR 
of 15 (which is quite large so far), the security gain is more than 250x and may be much higher 
with the hardware characterization taken into account as our results with the “cheating attacker” 
shows. Another way of understanding the 250-fold data complexity increase for the CPA is to turn 
this figure into a reduction of the SNR: according to |TPR13l IBDGNlib] , our DPL countermeasure 
has attenuated the SNR by a factor of at least \/250 ss 16. 


rate remaining always at « 1/16 (there are 2^ key guesses in present when targeting the first round, because the 
substitution boxes are 4 x 4) in average (at least on the number of traces we had (100,000)) on both on them. 
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(a) Monobit CPA attack on unprotected bitslice implementation. 



CPA for all 16 guesses (correct one In black), after 400 traces 


'i''i 






0 50 100 150 200 250 300 350 400 450 

Time {# of samples (x1000)) 


(b) Monobit CPA attack on poorly balanced DPL implementation (bits 0 and 1). 



CPA for all 16 guesses (correct one In black), after 9000 traces 



Time (#of samples (xIOOO)) 


(c) Monobit CPA attack on better balanced DPL implementation (bits 1 and 2). 



Figure 9: Attacks on our three implementations of present; 

Left: success rates (estimated with 100 attacks/step), and 
Right: CPA curves (whole first round in (a), and only sBoxLayer for (b) and (c)). 


26 








































































































